Objective

It is a meal delivery company which operates in multiple cities. They have various fulfillment centers in these cities for dispatching meal orders to their customers. The client wants you to help these centers with demand forecasting for upcoming weeks so that these centers will plan the stock of raw materials accordingly.

The replenishment of majority of raw materials is done on weekly basis and since the raw material is perishable, the procurement planning is of utmost importance. Secondly, staffing of the centers is also one area wherein accurate demand forecasts are really helpful. Given the following information, the task is to predict the demand for the next 10 weeks (Weeks: 146-155) for the center-meal combinations in the test set

Data Dictionary

Weekly Demand data (train.csv): Contains the historical demand data for all centers, test.csv contains all the following features except the target variable.

Variable Definition

id Unique ID

week Week No

center_id Unique ID for fulfillment center

meal_id Unique ID for Meal

checkout_price Final price including discount, taxes & delivery charges

base_price Base price of the meal

emailer_for_promotion Emailer sent for promotion of meal

homepage_featured Meal featured at homepage

num_orders (Target) Orders Count

fulfilment_center_info.csv: Contains information for each fulfilment center

Variable Definition

center_id Unique ID for fulfillment center

city_code Unique code for city

region_code Unique code for region

center_type Anonymized center type

op_area Area of operation (in km^2)

meal_info.csv: Contains information for each meal being served

Variable Definition

meal_id Unique ID for the meal

category Type of meal (beverages/snacks/soups….)

cuisine Meal cuisine (Indian/Italian/…)

In [1]:
# Importing Libraries

import numpy as np # linear algebra
import pandas as pd # data processing
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestRegressor


import warnings
warnings.filterwarnings('ignore')



        
import os
#for dirname, _, filenames in os.walk('\input'):
#   for filename in filenames:
#        print(os.path.join(dirname, filename))
    
C:\Users\sahibar.sheikh\AppData\Local\Continuum\anaconda3\lib\site-packages\statsmodels\tools\_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
In [2]:
pd.read_csv('input/train.csv')
Out[2]:
id week center_id meal_id checkout_price base_price emailer_for_promotion homepage_featured num_orders
0 1379560 1 55 1885 136.83 152.29 0 0 177
1 1466964 1 55 1993 136.83 135.83 0 0 270
2 1346989 1 55 2539 134.86 135.86 0 0 189
3 1338232 1 55 2139 339.50 437.53 0 0 54
4 1448490 1 55 2631 243.50 242.50 0 0 40
... ... ... ... ... ... ... ... ... ...
456543 1271326 145 61 1543 484.09 484.09 0 0 68
456544 1062036 145 61 2304 482.09 482.09 0 0 42
456545 1110849 145 61 2664 237.68 321.07 0 0 501
456546 1147725 145 61 2569 243.50 313.34 0 0 729
456547 1361984 145 61 2490 292.03 290.03 0 0 162

456548 rows × 9 columns

In [3]:
# Importing Raw files
train_raw = pd.read_csv('input/train.csv')
test_raw = pd.read_csv('input/test.csv')
meal = pd.read_csv('input/meal_info.csv')
centerinfo = pd.read_csv('input/fulfilment_center_info.csv')

Exploratory Data Analysis

In [4]:
print("The Shape of Demand dataset :",train_raw.shape)
print("The Shape of Fulmilment Center Information dataset :",centerinfo.shape)
print("The Shape of Meal information dataset :",meal.shape)
print("The Shape of Test dataset :",test_raw.shape)
The Shape of Demand dataset : (456548, 9)
The Shape of Fulmilment Center Information dataset : (77, 5)
The Shape of Meal information dataset : (51, 3)
The Shape of Test dataset : (32573, 8)

Check the content of each file

In [5]:
train_raw.head()
Out[5]:
id week center_id meal_id checkout_price base_price emailer_for_promotion homepage_featured num_orders
0 1379560 1 55 1885 136.83 152.29 0 0 177
1 1466964 1 55 1993 136.83 135.83 0 0 270
2 1346989 1 55 2539 134.86 135.86 0 0 189
3 1338232 1 55 2139 339.50 437.53 0 0 54
4 1448490 1 55 2631 243.50 242.50 0 0 40
In [6]:
centerinfo.head()
Out[6]:
center_id city_code region_code center_type op_area
0 11 679 56 TYPE_A 3.7
1 13 590 56 TYPE_B 6.7
2 124 590 56 TYPE_C 4.0
3 66 648 34 TYPE_A 4.1
4 94 632 34 TYPE_C 3.6
In [7]:
meal.head()
Out[7]:
meal_id category cuisine
0 1885 Beverages Thai
1 1993 Beverages Thai
2 2539 Beverages Thai
3 1248 Beverages Indian
4 2631 Beverages Indian
In [8]:
test_raw.head()
Out[8]:
id week center_id meal_id checkout_price base_price emailer_for_promotion homepage_featured
0 1028232 146 55 1885 158.11 159.11 0 0
1 1127204 146 55 1993 160.11 159.11 0 0
2 1212707 146 55 2539 157.14 159.14 0 0
3 1082698 146 55 2631 162.02 162.02 0 0
4 1400926 146 55 1248 163.93 163.93 0 0

Check for missing values

In [9]:
train_raw.isnull().sum().sum()
Out[9]:
0
In [10]:
test_raw.isnull().sum().sum()
Out[10]:
0

Observation : No missing value found in train and test data

In [11]:
print("The company  has" ,centerinfo["center_id"].nunique(), " warehouse " ,
    "spreed into  " ,  centerinfo["city_code"].nunique() ,"City and ",
    centerinfo["region_code"].nunique() , "Regions")

print("The products of the company are " ,meal["meal_id"].nunique(),"unique meals , devided into  "
,meal["category"].nunique(),"category and ",meal["cuisine"].nunique(),"cuisine")
The company  has 77  warehouse  spreed into   51 City and  8 Regions
The products of the company are  51 unique meals , devided into   14 category and  4 cuisine

Merge meal, centerinfo data with train and test data

In [12]:
#Merge train data with meal and center info
train = pd.merge(train_raw, meal, on="meal_id", how="left")
train = pd.merge(train, centerinfo, on="center_id", how="left")
print("Shape of train data : ", train.shape)
train.head()
Shape of train data :  (456548, 15)
Out[12]:
id week center_id meal_id checkout_price base_price emailer_for_promotion homepage_featured num_orders category cuisine city_code region_code center_type op_area
0 1379560 1 55 1885 136.83 152.29 0 0 177 Beverages Thai 647 56 TYPE_C 2.0
1 1466964 1 55 1993 136.83 135.83 0 0 270 Beverages Thai 647 56 TYPE_C 2.0
2 1346989 1 55 2539 134.86 135.86 0 0 189 Beverages Thai 647 56 TYPE_C 2.0
3 1338232 1 55 2139 339.50 437.53 0 0 54 Beverages Indian 647 56 TYPE_C 2.0
4 1448490 1 55 2631 243.50 242.50 0 0 40 Beverages Indian 647 56 TYPE_C 2.0
In [13]:
#Merge test data with meal and center info
test= pd.merge(test_raw, meal, on="meal_id", how="outer")
test = pd.merge(test, centerinfo, on="center_id", how="outer")
print("Shape of test data : ", test.shape)
test.head()
Shape of test data :  (32573, 14)
Out[13]:
id week center_id meal_id checkout_price base_price emailer_for_promotion homepage_featured category cuisine city_code region_code center_type op_area
0 1028232 146 55 1885 158.11 159.11 0 0 Beverages Thai 647 56 TYPE_C 2.0
1 1262649 147 55 1885 159.11 159.11 0 0 Beverages Thai 647 56 TYPE_C 2.0
2 1453211 149 55 1885 157.14 158.14 0 0 Beverages Thai 647 56 TYPE_C 2.0
3 1262599 150 55 1885 159.14 157.14 0 0 Beverages Thai 647 56 TYPE_C 2.0
4 1495848 151 55 1885 160.11 159.11 0 0 Beverages Thai 647 56 TYPE_C 2.0
In [14]:
# Typecasting to Assign appropriate data type to variables 

col_names=['center_id','meal_id','category','cuisine','city_code','region_code','center_type']
train[col_names] = train[col_names].astype('category')
test[col_names] = test[col_names].astype('category')

print("Train Datatype\n",train.dtypes)
print("Test Datatype\n",test.dtypes)
Train Datatype
 id                          int64
week                        int64
center_id                category
meal_id                  category
checkout_price            float64
base_price                float64
emailer_for_promotion       int64
homepage_featured           int64
num_orders                  int64
category                 category
cuisine                  category
city_code                category
region_code              category
center_type              category
op_area                   float64
dtype: object
Test Datatype
 id                          int64
week                        int64
center_id                category
meal_id                  category
checkout_price            float64
base_price                float64
emailer_for_promotion       int64
homepage_featured           int64
category                 category
cuisine                  category
city_code                category
region_code              category
center_type              category
op_area                   float64
dtype: object

Observation: No missing value in the input dataset

In [15]:
# Orders by centers
center_orders=train.groupby("center_id",as_index=False).sum()
center_orders=center_orders[["center_id","num_orders"]].sort_values(by="num_orders",ascending=False).head(10)

fig=px.bar(x=center_orders["center_id"].astype("str"),y=center_orders["num_orders"],
           title="Top 10 Centers by Order",labels={"x":"center_id","y":"num_orders"})
fig.show()

Observation : Center 13 has most order followed by 43 and 10

In [16]:
#Pie chart on food category

fig = px.pie(values=train["category"].value_counts(), names=train["category"].unique(),
             title="Most popular food category")
fig.show()

Beverage is by far the most popular food category

In [17]:
# Orders by Cuisine type

cuisine_orders=train.groupby(["cuisine"],as_index=False).sum()
cuisine_orders=cuisine_orders[["cuisine","num_orders"]].sort_values(by="num_orders",ascending=False)
fig=px.bar(cuisine_orders,x="cuisine",y="num_orders", title="orders by cuisine")
fig.show()

Itilian is the most popular cuisine

In [18]:
# Impact of check-out price on order
train_sample = train.sample(frac=0.2)
fig=px.scatter(train_sample,x="checkout_price",y="num_orders",title="number of order change with checkout price")
fig.show()
In [19]:
sns.boxplot(train["checkout_price"])
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f84d97ab88>

Observation :

  1. Majority of the orders placed are within checkout price < 450
  2. Outlier observed in the data
In [20]:
# orders weekly trend
week_orders=train.groupby(["week"],as_index=False).sum()
week_orders=week_orders[["week","num_orders"]]

fig = px.line(week_orders, x="week", y="num_orders",
              markers=True,title="Order weekly trend")
fig.show()

Observation :

  1. There is no clear seasonlity observed week over week. Weely orders are in same range with bit of ups and downs.

  2. Outlires are seen at week 62 with lowest total order. But it doesnt seem to be data entry mistake.

Feature Engineering

With the given data, We have derived the below features to improve our model performance.

Discount Percent : This defines the % discount offer to customer.

Discount Y/N : This defines whether Discount is provided or not - 1 if there is Discount and 0 if there is no Discount

In [21]:
#Discount Percent
train['discount percent'] = ((train['base_price']-train['checkout_price'])/train['base_price'])*100
#Discount Y/N
train['discount y/n'] = [1 if x>0 else 0 for x in (train['base_price']-train['checkout_price'])]

# Creating same feature in test dataset
test['discount percent'] = ((test['base_price']-test['checkout_price'])/test['base_price'])*100
test['discount y/n'] = [1 if x>0 else 0 for x in (test['base_price']-test['checkout_price'])]
In [22]:
train.head(2)
Out[22]:
id week center_id meal_id checkout_price base_price emailer_for_promotion homepage_featured num_orders category cuisine city_code region_code center_type op_area discount percent discount y/n
0 1379560 1 55 1885 136.83 152.29 0 0 177 Beverages Thai 647 56 TYPE_C 2.0 10.151684 1
1 1466964 1 55 1993 136.83 135.83 0 0 270 Beverages Thai 647 56 TYPE_C 2.0 -0.736214 0
In [23]:
# Check for correlation between numeric features
plt.figure(figsize=(13,13))
sns.heatmap(train.corr(),linewidths=.1,cmap='Reds',annot=True)
plt.title('Correlation Matrix')
plt.show()

Observation:

  1. Base price and Checkout price has correlation. We will drop base price in model feature selection.
  2. ID doesnt have any variation. This column will be dropped too from model feature selection
  3. homepage_featured, checkout_price, email_forpromotion and discount% has some correlation with target.

Evaluation Metrics

Model evaluation metric is 100*RMSLE where RMSLE is Root of Mean Squared Logarithmic Error across all entries in the test set.

Model Development

  • One hot encoding for categorical variables
  • Standardization of numerical variables
  • build Model
  • prediction
In [24]:
#Define One hot encoding function
def one_hot_encode(features_to_encode, dataset):
    encoder = OneHotEncoder(sparse=False)
    encoder.fit(dataset[features_to_encode])

    encoded_cols = pd.DataFrame(encoder.transform(dataset[features_to_encode]),columns=encoder.get_feature_names())
    dataset = dataset.drop(columns=features_to_encode)
    for cols in encoded_cols.columns:
        dataset[cols] = encoded_cols[cols]
    return dataset
In [25]:
#get list of caterogical varibles in data set
ls = train.select_dtypes(include='category').columns.values.tolist()

# Run one-hot encoding on all categorical variables

features_to_encode = ls
data = one_hot_encode(features_to_encode, train)
data = data.reset_index(drop = True)


# Train-Validation Data Split
y = data[["num_orders"]]
X= data.drop(["num_orders","id","base_price","discount y/n"],axis = 1)
X= X.replace((np.inf, -np.inf, np.nan), 0) # replace nan and infity values with 0

# 20% of train data is used for validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=100)
In [26]:
# Prepare test data post applying onehot encoding
OH_test = one_hot_encode(features_to_encode, test)
test_final= OH_test.drop(["id","base_price","discount y/n"],axis = 1)
In [27]:
#create pipeline for scaling and modeling

RF_pipe = make_pipeline(StandardScaler(),RandomForestRegressor(n_estimators = 100,max_depth = 7))
#Build Model
RF_pipe.fit(X_train, y_train)
#Predict Value
RF_train_y_pred = RF_pipe.predict(X_val)
In [28]:
# Model Evaluation- 

print('R Square:',RF_pipe.score(X_val, y_val))
print('RMSLE:', 100*np.sqrt(metrics.mean_squared_log_error(y_val, RF_train_y_pred)))
R Square: 0.6245024374709084
RMSLE: 85.57855787319313

Applying algorithm to predict orders

In [29]:
test_y_pred = RF_pipe.predict(test_final)
Result=pd.DataFrame(test_y_pred)
Result.values
Out[29]:
array([[279.41340907],
       [279.41340907],
       [279.41340907],
       ...,
       [ 60.02270367],
       [ 60.02270367],
       [ 60.02270367]])
In [30]:
Result=pd.DataFrame(test_y_pred)
Submission = pd.DataFrame(columns=['id', 'num_orders'])
Submission['id'] = test['id']
Submission['num_orders'] = Result.values

Submission.to_csv('output\submission_rfr.csv', index=False)
In [31]:
print(Submission.shape)
Submission.head()
(32573, 2)
Out[31]:
id num_orders
0 1028232 279.413409
1 1262649 279.413409
2 1453211 279.413409
3 1262599 279.413409
4 1495848 279.413409